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ABSTRACT 

The purpose of this study was to compare certain 
characteristics of multiple-choice (MC) and complex multiple-choice 
(CMCJ achievement tests designed to measure knowledge in 
medical-surgical nursing. Each of 268 junior and senior nursing 
students from four midwestern schools responded to one of four test 
forms. MC items were developed by converting original CMC items with 
four different systematic procedures:. Results showed that: (1) 
students responded to five MC items for every four CMC items, (2) MC 
tests were at least as reliable as CMC tests though they did not 
measure exactly the same traits, and (3) CMC tests were at least as 
difficult as MC tests. Recommendations were made for test users. 
(Author) 
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COMPARATIVE RBLIABILITIES AND VALIDITIES 
OF mTIPLE CHOICE AND COMPLEX KULTIPLE CHOICE 
NURSING EDUCATION TESTS 

The purposes of this study were to compare the reliabilities of multiple 
choice (MC) and complex multiple choice (CMC) achievement tests and to deter- 
mine the concurrent validities of MC tests that were written to measure 
understandings of concepts and relationships in medical-surgical nur3ing» 
CMC Items consist of a stem, a list of alternative responses called primary 
choices, and z list of responses called secondary choices, each of which is a 
combination of the primary choices. Students select their response for a 
CMC item from the list of secondary choices, only one of which is correct* 
The CMC format is illustrated in Figure 1 by Item lA. 



IA. Which of the following are frequent side effects 
of oral contraceptives? 

a. Nausea 

b. Dizziness 
c • Headache 

d* Weight gain 

e. Breast discomfort 

a and b 
c and d 
All but e 
All the above 

IB. Which of the following are frequent side effects 
of oral contraceptives? 

Nausea and dizziness 
Headache and weight gain 
Dizziness and headache 
All the above 



1. 

2. 

3. 



1. 

2. 

3. 



Figure 1. Sample Complex Multiple-Choice Item 
Converted to Multiple-Choice Format 
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2. 

The major queBtlons fomrulated as research hypotheses were: 

1. Are MC and CMC achievement tests that were designed to measure the saicc 
objectives equally reliable? 

2. What As the ratio of number of MC items attempted to the nunber of CMC 
items attempted by a group of examinees in a fixed period of time? 

3. Is the correlation between individuals' MC and CMC subtest scores perfect: 
(+1.00) when corrected for attenuation? 

4. Are MC tests derived from CMC tests equally difficult? 

Method 

The CMC items used in this 3tudy were similar to those piibllshed to assist 
student nurses in reviewing for iitate licensure examinations and to provide 
guidance for nursing instructors In preparing classroom achievement tests. 
Sixty-four four-choice CMC items designed to measure knowledge, comprehension, 
and application in medical-surglcaA nursing were identified for test development 
purposes. The keyed secondary choice for a CMC item could consist of one, two, 
three, or all four primary choices. This relationship yielded four systematic 
procedures for converting CMC items to MC form. The 64 original CMC Items were 
randomly split into two subtests, called CI and C2 and were cemwttd to MC 
subtests, called Ml and M2, respectively. Porms Ml and M2 were each comprised 
of eight items converted by each of the four procedures The four final test 
forms, C1H2, C2M1, M1C2, and M2C1, contained 16 items of each of the four types 
and neither CMC or MC subtest consistently preceded the other. ^ 



^Details of the item conversion procedures are in Dryden, 1974. 
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The subjects selected for testing were 212 jimior and 56 senior nursing 
students at four midwestem ^schools of nursing. Three of the schools were 
hospital-affiliated and offered a diploma program. The fourth institution 
was an urban university with a baccalaureate degree program. Students were 
not randomly selected but all available students at these schools who were 
willing to participate were used. T!iere is no reason to suspect that the 
group of subjects is vastly different from students in similar programs at 
other institutions. 

Procedures 

The study was designed to control various sources of random and systematic 
error. Each subject responded to only one tesi: form and the four forms were 
randomly distributed in groups within each school. Subtest orders were 
counterbalanced. Explicit directions were read for each test administration 
and a stopwatch was used for timing the first 10 minutes of testing. 

Subjects were stopped after 10 minutes of testing and were instructed 
to circle the ntnnbcr of the item they had been working on. Random marking 
of answer sheets was not observed and each subject was able to complete the 
examination . 

Results 

The ratio of the number of MC to CMC items that subjects attempted In 
the first 10 minutes of testing was determined to be 1.25. The median 
number attempted was 23*33 and 18.61 for MC and CMC, respectively. 

Kuder-Richardson Formula 20 reliability coefficients computed for each 
of the eight subtests are reported in Table 1. The reliabilities of the MC 
subtests were adjusted with the Spearman-Brown Formula (n « 1.25) to equate 
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testing time* Each of the four adj\isted MC reliability coefficientc was 
larger than the corresponding ChC reliability coefficient. The difference 
were tested for statistical significance by computing 90 percent confidence 
iater^rals using a method developed by Feldt (1965) . Table 2 is a display of 
the upper and lonrer bounds of the confidence intervals. In the two pairs of 
Intervals which did not overlan, the MC reliability was higher than the CMC 
reliability in each case. 



TABLE 1 

X-fln/, Reliabilities fov Final Subtest Forms 



Test Fom 


Subtest 


Complex 
Multiple-choice 


Multiple-choice 


Original 


Adiusted 


C1M2 


.5991 


.5692 


.6228 


M7.G2 


.3257 


.3376 


.3892 


C2M1 


.1378 


.3328 


.3840 


M2C1 


.3680 


.5431 


.5977 



TABLE 2 

ninety Percent Confidence Intervale 
for K-R„„ Reliability Coefficients 



Subtest 


Test Fonn 


Upper 
Limit 


Lower 
Limit 


CI 


C1M2 


.7037 


.4828 




M1C2 


.5486 


.2121 


Cl 


M2C1 


.5329 


.1829 


Ml 


C2M1 


.5448 


.2054 


C2 


C2M1 


.3998 


-.0477 


H2 


M2C1 


.7027 


.4810* 


C2 


M1C2 


.5017 


.1302 


M2 


C1M2 


.7212 


.5134* 



♦Indicates the comparisons which 
did not overlap. 
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Since each subject received a MC and a CMC subcest score, a Pearson 
produce-moment correlation was computed between subtest scores on each of 
the four test forms. Each correlation was adjusted for unreliability by 
correcting for attenuation. The original and corrected correlations are 
reported in Table 3. 



TABLE 3 

Correlation Coefficients for Multiple^choioe 
and Complex Multiple^dhoice Subtest Scores 
an Each Final Test Form 



Test Form 


r 

tflC 


a 

r 

CO 00 


n 


M1C2 


.193 


.582 


67 


M2C1 


.423 


.946 


67 


C1M2 


.592 


1.014 


68 


C2M1 


.392 


1.569 


66 



^Disattenuated correlation 
coefficients. 



Ninety percent confidence intervals for the disattenuated coefficients 
computed using a method developed by Forsyth and Peldt (1969), The 
: and lower limits are given in Table 4. The hypothesis that the 



TABLE 4 



Ninety Percent Confidence Intervals for 
Disattenuated Correlation Coefficients 



Test Form 


r 

CO 00 


Est. Standard 
Error 


Upper 
Limit 


Lower 
Limit 


M1C2 


.582 


.0490 


1.0977 


.9023 


M2C1 


.946 


.0196 


1.0397 


.9603 


C1M2 


1.014 


.0038 


1.0117 


.9883 


C2M1 


1.569 


.1234 


1.2461 


.7539 
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disattenuated correlations do not differ from unity was not supported In any 
of the four cases. 

A one-tailed t_ teat was applied to test the differences in means on 
subtests which contained different but corresponding items. Means and 
standard deviations are shown in Table 5. Ti\e difference between the mean 
nxaaber correct on subtests Ml and CI was not significant (jt « AOl, df « 266, 
£ > .05). However, the difference between subtests M2 and C2 was significant 
(t « 3.02, df » 266, £ < .05). 



TABLE 5 

Subtest Means and Standard Deviations 



Test 


Subtest 


Mean 


Standard 
Deviation 


N 


Cl«2 
"2^1 


^1 


16.15 


3.46 


135 


^2% 




16.31 


3.03 


133 


"1^2 




16.31 


2.90 


133 


"2^1 




17.56 


3.77 


135 
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Discussion 

Conclusloas drawn from the findings of this study should be regarded as 
tentative pending a replication of the study* The authors are not aware of 
other reseerch reported regarding the questions studied here. 

The results suggested that students can attempt five MC Items In the 
time required to try four CKC Items. In a 40-mlnute testing session, therefore, 
93 MC or 74 CMC might be used If the relative responding rates of examinees 
are 5:4 beyond the first 10 minutes of testing. This would imply that a MC 
test is likely to better sample the content domain than is a CMC test when a 
given amount of testing time is available. The reliability evidence also 
indicated that the longer test is more reliable. 

The fact that the MC and CMC reliabilities differed significantly in 
only two cases indicates that some factor other than item format was affecting 
the reliabilities. One factor that probably influenced the reliabilities of 
the original CMC subtests vas the difficulty level of the items. The wean 
item difficulties (percent of the group responding incorrectly) on the four 
original CMC subtests were 48, 50, 51, and 48. These averages arc too high 
for obtaining .aaximum reliability. If the item difficulties had averaged 
about 37.5, the items may have been higher in discrimination and, therefore, 
made for a more reliable test. 

The MC-CMC subtest correlations were less than perfect. Though two of 
the disattenuated correlation coefficients were "close" for practical pur- 
poses, further research is needed before educational import can be attached 
to this finding. Though the converted items vere similar to the original 
CMC items in content, they were not made up of corresponding converted Items. 
There also was a problem with the reliabilities of the Ml and C2 subtests; 
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apparently the quality of the original CMC items was insufficient* Research 
on other item format comparisons (Frisbie 1973, Frisbie 1974) supports the 
notion that slightly different skills may be required of the examinee when 
item format varies. Further research is necessary before che extent of these 
differences and the specificity of the skills can ^e identified. The question 
of what is measured when a particular item format is employed certainly has 
a bearing on test validity in achievement testing situations. 

Theoretically-derived chance scores on the MC and CMC tests used here 
were identical; subtest lengths and number of alternatives per item were the 
same. The conflicting results obtained when test difficulties were compared 
may h^ve been produced by the relatively high item difficulties. Subjects 
could not answer many of the items correctly no matter which format the items 
were in* The findings regarding relative difficulties were at best inconclusive. 

The results of this study suggest that more research in this area needs 
to be done if any sotind conclusions are to be reached. A study comparing 
these two item formats, but using original CMC items of better quality than 
those used in this study, may yield more conclusive results. Factors present 
In the original items may have been the source of the difficulties in this 
study. Future studies might also include a valid external criterion in an 
attempt to clarify the validity question. If CMC and MC items do not measure 
the same skills and knowledge, which of the two is a better measure of the 
traits intended to be measured? Response rate with different item formats 
also merits further study. The data reported here reflect rate of response 
during the initial ten minutes of testing. The assumption has been made that 
this rate remains constant throughout the remainder of the testing period. 
The assumption actually represents an empirical question which should be ad- 
dressed because It relates to projected test length anc size of adjustment 
of the reliability estimate when testing time is held constant. 

o iO 
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